Week 7 – Confidence Intervals for the Slope

This week’s reading is a compilation of Chapter 8 from ModernDive (Kim et al. 2020), Chapter 24 from Introduction to Modern Statistics (Çetinkaya-Rundel and Hardin 2023), with a smattering of my own ideas.

0.1 Reading Guide

Download the reading guide as a Word Document here

Download the reading guide as an HTML file here

0.2 Sampling Review

In the last reading, we studied the concept of sampling variation. Using the example of estimating the proportion of red balls in a bowl, we started with a “tactile” exercise where a shovel was used to draw a sample of balls from the bowl. While we could have performed an exhaustive count of all the balls in the bowl, this would have been a tedious process. So instead, we used a shovel to extract a sample of balls and used the resulting proportion that were red as an estimate. Furthermore, we made sure to mix the bowl’s contents before every use of the shovel. Because of the randomness created by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the bowl’s balls that are red.

We then used R to mimick this “tactile” sampling process. Using our computer’s random number generator, we were able to quickly mimick the tactile sampling procedure a large number of times. Moreover, we were able to explore how different our results would be if we used different sized shovels, with 25, 50, and 100 slots. When we visualized the results of these three different shovel sizes, we saw that as the sample size increased, the variation in the estimates (\(\widehat{p}\)) decreased.

These visualizations of the repeated sampling from the bowl have a special name in Statistics – a sampling distribution. These distributions all us to study how our estimates ($) varied from one sample to another; in other words, we wanted to study the effect of sampling variation. Once we had over 1,000 different estimates, we quantified the variation of these estimates using their standard deviation, which also has a special name in Statistics – the standard error. Visually we saw the spread of the sampling distributions get narrower as the sample size increased, which was reiterated by the standard errors – the standard errors decreased as the sample size increased. This decrease in spread of the sampling distribution gives us more precise estimates that varied less around the center.

We then tied these sampling concepts to the statistical terminology and mathematical notation related to sampling. Our study population was the large bowl with \(N\) = 2400 balls, while the population parameter, the unknown quantity of interest, was the population proportion \(p\) of the bowl’s balls that were red. Since performing a census would be expensive in terms of time and energy, we instead extracted a sample of size \(n\) = 50. The point estimate, also known as a sample statistic, used to estimate \(p\) was the sample proportion \(\widehat{p}\) of these 50 sampled balls that were red. Furthermore, since the sample was obtained at random, it can be considered as unbiased and representative of the population. Thus any results based on the sample could be generalized to the population. Therefore, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the bowl’s balls that are red. In other words, we used the sample to infer about the population.

However, we acknowledged that both the tactile and virtual sampling exercises are not what one would do in real life; this was merely an activity used to study the effects of sampling variation. In a real-life situation, we would not take 1,000s of samples of size \(n\), but rather take a single representative sample that’s as large as possible. Additionally, we knew that the true proportion of the bowl’s balls that were red was 37.5%. In a real-life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it?

So how does one quantify the effects of sampling variation when you only have a single sample to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is bootstrapping resampling, which will be the focus of the earlier sections of this reading.

Furthermore, what if we would like not only a single estimate of the unknown population parameter, but also a range of highly plausible values? For example, when you read about political polls, they tell you the percent of all Californians who support a specific measure, but in addition to this estimate they provide the poll’s “margin of error”. This margin of error can be used to construct a range of plausible values for the true percentage of people who support a specific measure. This range of plausible values is what’s known as a confidence interval, which will be the focus of the later sections of this reading.

1 Baby Birth Weights

Medical researchers may be interested in the relationship between baby birth weight and the age of the mother, so as to provide medical interventions for specific age groups if it is found that they are associated with lower birth weights.

Every year, the US releases to the public a large data set containing information on births recorded in the country. The births14[http://openintrostat.github.io/openintro/reference/births14.html] dataset is a random sample of 1,000 cases from one such dataset released in 2014.

1.1 Observed data

Figure 1 visualizes the relationship between mage and weight for this sample of 1,000 birth records.

ggplot(data = births14, 
       mapping = aes(x = mage, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Mother's Age", 
       y = "Birth Weight of Baby (lbs)")

Figure 1: Weight of baby at birth (in lbs) as explained by mother’s age.

Table 1 displays the estimated regression coefficients for modeling the relationship between mage and weight for this sample of 1,000 birth records.

births_lm <- lm(weight ~ mage, data = births14)

get_regression_table(births_lm)
Table 1: The least squares estimates of the intercept and slope for modeling the relationship between birth weight and mother’s age.
term estimate std_error statistic p_value lower_ci upper_ci
intercept 6.793 0.208 32.651 0.000 6.385 7.201
mage 0.014 0.007 1.987 0.047 0.000 0.028

Based on these coefficients, the estimated regression equation is:

\[ \widehat{\text{birth weight}} = -3.598 + 0.279 \times \text{weeks of pregnancy}\]

Note

We will let \(\beta_1\) represent the slope of the relationship between baby birth weight and mother’s age for every baby born in the US in 2014. We will estimate \(\beta_1\) using the births14 dataset, labeling the estimate \(b_1\) (just as we did in Week 4).

Danger

A parameter is the value of the statistic of interest for the entire population.

We typically estimate the parameter using a “point estimate” from a sample of data. The point estimate is also referred to as the statistic.

1.2 Variability of the statistic

This sample of 1,000 births is only one of possibly tens of thousands of possible samples that could have been taken from the large dataset released in 2014. So, then we might wonder how different our regression equation would be if we had a different sample. There is no reason to believe that \(\beta_1\) is 0.279, but there is also no reason to believe that \(\beta_1\) is particularly far away from \(b_1 =\) 0.279.

Just this week you read about how estimates, such as \(b_1\), are prone to sampling variation – the variability from sample to sample. For example, if we took a different sample of 1,000 births, would we obtain a slope of exactly 0.279? No, that seems fairly unlikely. We might obtain a slope of 0.29 or 0.26, or even 0.35!

When we studied the effects of sampling variation, we took many samples, something that was easily done with a shovel and a bowl of red and white balls. In this case, however, how would we obtain another sample? Well, we would need to go to the source of the data—the large public dataset released in 2014—and take another random sample of 1,000 observations. Maybe we don’t have access to that original dataset of 2014 births, how could we study the effects of sampling variation using our single sample? We will do so using a technique known as bootstrap resampling with replacement, which we now illustrate.

1.3 Resampling once

Figure 2: Step 1: Write out mother’s ages and birth weights on 1,000 slips of paper representing one of the 1,000 births included in the original sample.

Step 1: Print out 1,000 identically sized slips of paper (or post-it notes) representing the sample of 1,000 babies in our sample. On each piece of paper, write the mother’s age and the birth weight of the baby. Figure 2 displays six of these such papers.

Step 2: Put the 1,000 slips of paper into a hat as seen in Figure 3.

Figure 3: Step 2: Putting 1,000 slips of paper (post-its) in a hat.

Step 3: Mix the hat’s contents and draw one slip of paper at random, as seen in Figure 4. Record the mother’s age and birth weight, as printed on the paper.

Figure 4: Step 3: Drawing one slip of paper at random.

Step 4: Put the slip of paper back in the hat! In other words, replace it as seen in Figure 5.

Figure 5: Step 4: Replacing slip of paper.

Step 5: Repeat Steps 3 and 4 a total of 999 more times, resulting in 1,000 recorded mother’s ages and baby birth weights.

What we just performed was a resampling of the original sample of 1,000 birth weights. We are not sampling 1,000 birth weights from the population of all 2014 US births as we did for our original sample. Instead, we are mimicking this process by resampling 1,000 birth weights from our original sample.

Now ask yourselves, why did we replace our resampled slip of paper back into the hat in Step 4? Because if we left the slip of paper out of the hat each time we performed Step 4, we would end up with the same 1,000 birth weights! In other words, replacing the slips of paper induces sampling variation.

Being more precise with our terminology, we just performed a resampling with replacement from the original sample of 1,000 birth weights. Had we left the slip of paper out of the hat each time we performed Step 4, this would be resampling without replacement.

Let’s study our sample of 1,000 resampled birth weights and mother’s ages. First, I’ve made a table of the number of times each observation (birth record) was resampled. Table 2 displays the birth records that were resampled the most often. Based on the table, it appears that record IDs 71 and 534 were resampled six times.

Table 2: Frequencies of how often a given birth record (ID) was resampled.
71 534 170 561 19 142 305 309 495 573 726 727 835 838 840 864 919 924 929 953
6 6 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Remember

When sampling with replacement, not every observation is guaranteed to be sampled. So, some observations from the original sample may never appear in the resample, whereas others may appear multiple times.

Figure 6 compares the relationship between mother’s age and baby birth weight from our resample with the relationship in our original sample.

ggplot(data = births14, 
       mapping = aes(x = mage, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Mother's Age", 
       y = "Birth Weight (lbs)")

ggplot(data = resample1, 
       mapping = aes(x = mage, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Mother's Age", 
       y = "Birth Weight (lbs)")

(a) Original Sample of 1,000 Birth Records

(b) Resample of 1,000 Birth Records

Figure 6: Comparing relationship between mage and weight in the resampled birth records compared to the relationship seen in the original sample of birth records.

Observed in Figure 6 that while the general shape of both relationships are similar, they are not identical.

Recall, from the previous section that the sample slope (\(b_1\)) from the original sample was . What about for our resample? Based on the scatterplot, what would your guess be? Larger than before? Smaller than before?

Let’s look at the coefficient estimates for the resampled dataset. Table 3 displays the estimated regression coefficients for modeling the relationship between mage and weight for the resample of 1,000 birth records.

resample_lm <- lm(weight ~ mage, data = resample1)

get_regression_table(resample_lm)
Table 3: The least squares estimates of the intercept and slope for modeling the relationship between birth weight and mother’s age, for the resample of 1,000 birth records.
term estimate std_error statistic p_value lower_ci upper_ci
intercept 6.981 0.220 31.738 0.000 6.549 7.413
mage 0.005 0.008 0.684 0.494 -0.010 0.020

For the resampled dataset, the relationship between mother’s age and baby birth weight is much weaker than in the orginal sample, with an estimated slope of \(b_1 =\) 0.005.

What if we repeated this resampling exercise many times? Would we obtain the same slope each time? In other words, would our guess at the slope for the relationship between mother’s age and birth weight for all births in the US in 2014 exactly 0.005 every time? Just as we did in the last chapter, let’s perform this resampling activity with the help of some of our friends: 35 friends in total.

2 Computer simulation and resampling

It should be very clear that tactile resampling with a dataset with 1,000 observations would be extremely time consuming, nothing we would want to ask our friends to do. A computer, however, would be happy to do this process for us!

2.1 Virtually resampling once

First, let’s perform the virtual analog of resampling once. Recall that the births14 dataset included in the openintro package contains the 1,000 birth records from the original study. Furthermore, recall in the last chapter that we used the rep_sample_n() function as a virtual shovel to sample balls from our virtual bowl of 2400 balls as follows:

Let’s modify this code to perform the resampling with replacement from the 1,000 birth records in the original sample:

virtual_resample <- rep_sample_n(births14, 
                                 size = 1000,
                                 replace = TRUE, 
                                 reps = 1)

Observe how we explicitly set the replace argument to TRUE in order to tell rep_sample_n() that we would like to sample birth records with replcement. Had we not kept replace = FALSE, we would have done resampling without replacement. Additionally, we changed the sample size, as we need to create a sample with the same size as the original sample which had 1,000 observations.

Let’s look at only the first 10 out of 1000 rows of virtual_resample:

virtual_resample
# A tibble: 1,000 × 15
# Groups:   replicate [1]
   replicate    ID  mage weight  fage mature  weeks premie visits gained lowbi…¹
       <int> <int> <dbl>  <dbl> <int> <chr>   <dbl> <chr>   <dbl>  <dbl> <chr>  
 1         1   333    29   7.4     31 younge…    39 full …     13     50 not low
 2         1   595    29   0.75    NA younge…    21 premie      4     NA low    
 3         1   338    31   6.94    NA younge…    39 full …      9     20 not low
 4         1   531    33   6.94    37 younge…    39 full …     14     20 not low
 5         1   289    35   7.98    35 mature…    38 full …     10     40 not low
 6         1   461    32   7.15    33 younge…    39 full …     13     NA not low
 7         1   137    26   7.69    36 younge…    38 full …     NA     25 not low
 8         1   828    36   4.54    65 mature…    36 premie     11     10 low    
 9         1   118    26   8.31    24 younge…    37 full …     12     25 not low
10         1    95    38   6.04    37 mature…    36 premie     10     32 not low
# … with 990 more rows, 4 more variables: sex <chr>, habit <chr>,
#   marital <chr>, whitemom <chr>, and abbreviated variable name
#   ¹​lowbirthweight

The replicate variable only takes on the value of 1 corresponding to us only having reps = 1, the ID variable indicates which of the 1,000 birth records was resampled, and mage and weight denote mother’s age and the baby’s birth weight, respectively.

Let’s now compute the slope for the relationship between mother’s age and baby’s birth weight for this resample:

virtual_resample_lm <- lm(weight ~ mage, data = virtual_resample)

get_regression_table(virtual_resample_lm)
Table 4: The least squares estimates of the intercept and slope for modeling the relationship between birth weight and mother’s age, for the virtual resample of 1,000 birth records.
term estimate std_error statistic p_value lower_ci upper_ci
intercept 7.156 0.224 31.974 0.000 6.716 7.595
mage 0.002 0.008 0.210 0.834 -0.014 0.017

As we saw when we did our tactile resampling exercise, Table 4 shows that the estimated slope is different from the slope of our original sample of 0.005.

2.2 infer package workflow

Unfortunately, our process of virtual resampling relies on us fitting a linear regression for each replicate of our virtual_resample dataset. This gets a bit tricky coding wise, as we would need to fit 35 different linear regressions if we had 35 different resamples of our data.

Enter, infer, an R package for statistical inference. infer makes efficient use of the %>% pipe operator we learned in Week 3 to spell out the sequence of steps necessary to perform statistical inference in a “tidy” and transparent fashion. Furthermore, just as the dplyr package provides functions with verb-like names to perform data wrangling, the infer package provides functions with intuitive verb-like names to perform statistical inference.

Let’s go back to our original slope. Previously, we computed the value of the sample slope \(b_1\) using the lm() function:

births_lm <- lm(weight ~ mage, data = births14)

get_regression_table(births_lm)

We’ll see that we can also do this using infer functions specify() and calculate():

virtual_resample %>% 
  specify(response = weight, 
          explanatory = mage) %>% 
  calculate(stat = "slope")

You might be asking yourself: “Isn’t the infer code longer? Why would I use that code?”. While not immediately apparent, you’ll see that there are three chief benefits to the infer workflow as opposed to the lm() function workflow we had previously.

First, the infer verb names better align with the overall resampling framework you need to understand to construct confidence intervals and to conduct hypothesis tests. We’ll see flowchart diagrams of this framework in the upcoming Figure 12.

Second, you can jump back and forth seamlessly between confidence intervals and hypothesis testing with minimal changes to your code. This will become apparent next week, when we also use infer to conduct hypothesis tests for the slope statistic.

Third, the infer workflow is much simpler for conducting inference when you have more than one variable, meaning we can extend what we’ve learned for simple linear regression to multiple linear regression models.

Let’s now illustrate the sequence of verbs necessary to construct a confidence interval for \(\b_1\), the slope of the relationship between mother’s age and baby’s birth weight for all US births in 2014.

1. specify variables

Figure 7: “Diagram of the specify() verb.”

As shown in Figure 7, the specify() function is used to choose which variables in a data frame will be the focus of our statistical inference. We do this by specifying the response argument. For example, in our births14 data frame, the response variable of interest is weight and the explanatory variable is mage:

births14 %>% 
  specify(response = weight, 
          explanatory = mage)
Response: weight (numeric)
Explanatory: mage (numeric)
# A tibble: 1,000 × 2
   weight  mage
    <dbl> <dbl>
 1   6.96    34
 2   8.86    31
 3   7.51    36
 4   6.19    16
 5   6.75    31
 6   6.69    26
 7   6.13    36
 8   6.74    24
 9   8.94    32
10   9.12    26
# … with 990 more rows

Notice how the dataset got smaller, now there are only two columns where before there were 14. You should also notice the messages above the dataset (Response: weight (numeric) and Explanatory: mage). These are meta-data about the grouping structure of the dataset, declaring which variable has been assigned to the explantory / response.

Note

This is similar to how the group_by() verb from dplyr doesn’t change the data, but only adds “grouping” meta-data, as we saw in Week 3.

2. generate replicates

Figure 8: “Diagram of generate() replicates.”

After we specify() the variables of interest, we pipe the results into the generate() function to generate replicates. Figure 8 shows how this is combined with specify() to start the pipeline. In other words, repeat the resampling process a large number of times, similar to how we collected 35 different samples of balls from the bowl.

The generate() function’s first argument is reps, which sets the number of replicates we would like to generate. Suppose we were interested in obtaining 50 different resamples (each of 1000 birth records). Then, we would we set reps = 50, telling infer that we are interested in obtaining 50 different resamples, each with 1000 observations.

The second argument type determines the type of computer simulation we’d like to perform. We set this to type = "bootstrap" indicating that we want to perform bootstrap resampling, meaning the resampling should be done with replacement. You’ll see different options for type when we learn about hypothesis testing next week.

births14 %>% 
  specify(response = weight, 
          explanatory = mage) %>% 
  generate(reps = 50, 
           type = "bootstrap")
Response: weight (numeric)
Explanatory: mage (numeric)
# A tibble: 50,000 × 3
# Groups:   replicate [50]
   replicate weight  mage
       <int>  <dbl> <dbl>
 1         1  10.6     31
 2         1   4.31    31
 3         1   4.94    33
 4         1   5.65    36
 5         1   8       23
 6         1   7.63    37
 7         1   8       19
 8         1   3.22    30
 9         1   8.56    32
10         1   4.32    21
# … with 49,990 more rows

Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 1000 birth records with replacement 50 times and 1000 \(\times\) 50 = 50,000.

The variable replicate indicates which resample each row belongs to. So it has the value 1 1000 times, the value 2 1000 times, all the way through to the value 50 1000 times.

Comparing with original workflow: Note that the steps of the infer workflow so far produce the same results as the original workflow using the rep_sample_n() function we saw earlier. In other words, the following two code chunks produce similar results:

infer workflow

births14 %>%
  specify(response = weight, 
          explanatory = mage) %>% 
  generate(reps = 50, 
           type = "bootstrap") 

Original workflow

rep_sample_n(births14, 
             size = 1000, 
             replace = TRUE,
             reps = 50)

3. calculate summary statistics

Figure 9: Diagram of calculate() summary statistics.

After we generate() many replicates of bootstrap resampling with replacement, we next want to summarize each of the 50 resamples of size 1000 to a single sample statistic value. As seen in Figure 9, the calculate() function does this.

In our case, we want to calculate the slope between mother’s age and baby’s birth weight for each bootstrap resample of size 1000. To do so, we set the stat argument to "slope".

Tip

You can also set the stat argument to a variety of other common summary statistics, like "median", "sum", "sd" (standard deviation), and "prop" (proportion). To see a list of all possible summary statistics you can use, type ?calculate and read the help file.

Let’s save the result in a data frame called bootstrap_distribution and explore its contents:

bootstrap_distribution <- births14 %>% 
  specify(response = weight, 
          explanatory = mage) %>% 
  generate(reps = 50) %>% 
  calculate(stat = "slope")

bootstrap_distribution
Response: weight (numeric)
Explanatory: mage (numeric)
# A tibble: 50 × 2
   replicate    stat
       <int>   <dbl>
 1         1 0.0328 
 2         2 0.0181 
 3         3 0.0123 
 4         4 0.0106 
 5         5 0.0183 
 6         6 0.0190 
 7         7 0.0145 
 8         8 0.00824
 9         9 0.0155 
10        10 0.0160 
# … with 40 more rows

Observe that the resulting data frame has 50 rows and two columns. The replicate column corresponds to the 50 replicates and the stat column corresponds to the estimated slope for each resample.

4. visualize the results

Figure 10: Diagram of visualize() results.

The visualize()verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical stat variable’s values. The pipeline of the main infer verbs used for exploring bootstrap distribution results is shown in Figure 11.

visualize(bootstrap_distribution)

Figure 11: Bootstrap distribution of slope statistics from 50 bootstrap resamples.

Tip

The visualize() function can take many other arguments which we’ll see momentarily to customize the plot further. It also works with helper functions to do the shading of the histogram values corresponding to the confidence interval values.

Comparing with original workflow: In fact, visualize() is a wrapper function for the ggplot() function that uses a geom_histogram() layer. That’s a lot of fancy language which means that the visualize() function does the same thing as we did previously with ggplot(), just with fewer steps.

infer workflow

visualize(bootstrap_distribution)

Original workflow

ggplot(data = bootstrap_distribution,
       mapping = aes(x = stat)) +
  geom_histogram()
Danger

Careful! It might sound tempting to ditch the ggplot() code altogether now that you know of a simpler approach. The visualize() function only works for a spefic case – a data frame containing a distribution of calculate()d statistics.

Let’s recap the steps of the infer workflow for constructing a bootstrap distribution and then visualizing it in Figure 12.

Figure 12: infer package workflow for resampling

2.3 Virtually resampling 1000 times

2.4 Connection to sampling distributions

3 Understanding confidence intervals

3.1 Precentile method

3.2 Standard error method

4 Constructing confidence intervals

Recall how we introduced two different methods for constructing 95% confidence intervals for an unknown population parameter in Section @ref(ci-build-up): the percentile method and the standard error method. Let’s now check out the infer package code that explicitly constructs these. There are also some additional neat functions to visualize the resulting confidence intervals built-in to the infer package!

Percentile method with infer

Standard error method with infer

5 Interpreting confidence intervals

5.1 Did the net capture the fish?

5.2 Precise and shorthand interpretation

5.3 Withd of confidence intervals

Çetinkaya-Rundel, Mine, and Jo Hardin. 2023. Introduction to Modern Statistics. OpenIntro. https://openintro-ims.netlify.app/.
Kim, Albert Y., Chester Ismay, Starry Zhou, Marium Tapal, Dorothy Bishop, WatanabeSmith, Tobias Rockel, et al. 2020. Moderndive/ModernDive_book: Cleaned Bookdown Code, Typo Fixes, New Tips & Tricks Page. Zenodo. https://doi.org/10.5281/ZENODO.3932023.